Apache Spark is an open-source engine for large-scale
parallel data processing known for its speed, ease of use, and
cutting-edge analytics.
Azure Databricks is an analytics platform based on
Microsoft Azure cloud services, providing the latest versions of Apache
Spark and allowing consistent integration with open source libraries.
Built with Spark capabilities, Azure Databricks provides a cloud
platform with interactive workspaces and fully managed Spark
clusters.
Twitter (Elon Musk 2015-2022):
Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in
JSON format, where each tweet is in its own separate JSON object. All
Tweets are collected, parsed, and plotted using rtweet in
R. In total, there are more than thousands of tweets in this dataset,
including retweets and replies. All objects are to go into a single
database.
| 1 | status_id | 14 | hashtags | 27 | quoted_followers_count | 40 | retweet_location |
| 2 | created_at | 15 | symbols | 28 | quoted_location | 41 | retweet_description |
| 3 | user_id | 16 | media_expanded_url | 29 | quoted_description | 42 | retweet_verified |
| 4 | screen_name | 17 | media_type | 30 | quoted_verified | 43 | name |
| 5 | text | 18 | mentions_screen_name | 31 | retweet_status_id | 44 | location |
| 6 | source | 19 | quoted_status_id | 32 | retweet_text | 45 | description |
| 7 | reply_to_screen_name | 20 | quoted_text | 33 | retweet_created_at | 46 | followers_count |
| 8 | is_quote | 21 | quoted_created_at | 34 | retweet_source | 47 | friends_count |
| 9 | is_retweet | 22 | quoted_source | 35 | retweet_favorite_count | 48 | account_created_at |
| 10 | favorite_count | 23 | quoted_favorite_count | 36 | retweet_retweet_count | 49 | verified |
| 11 | retweet_count | 24 | quoted_retweet_count | 37 | retweet_user_id | ||
| 12 | quote_count | 25 | quoted_user_id | 38 | retweet_screen_name | ||
| 13 | reply_count | 26 | quoted_screen_name | 39 | retweet_followers_count |
Once you have your twitter app setup and loaded the
rtweet package in R, you can run the following command to
generate a token for the Twitter data. The first thing that you need to
setup in your code is your authentication and then create a token that
authenticates access to tweets.
library(rtweet)
twitter_token <- create_token(
app = "AppName",
consumer_key = < api_key > ,
consumer_secret = < api_key_secret > ,
access_token = < access_token > ,
access_secret = < access_token_secret >
)
use_oauth_token(twitter_token)
Now we search Twitter’s full archive API. We run the
search_fullarchive command in order to be able to access
historical tweets from a user. The example below captures all of Elon
Musk’s tweets from January 01, 2010 to May 28, 2022.
users <- "elonmusk"
musktweets <- search_fullarchive(
paste0("from:", users),
n = < number of tweets >,
env_name = < enviroment-name >,
fromDate = < YYMMDDHHMM >,
toDate = < YYMMDDHHMM >
)
Data Preview:
| created_at | screen_name | text | source | favorite_count | retweet_count | quote_count | reply_count | hashtags | media_type | is_quote | is_retweet |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022-05-27 22:52:53 | elonmusk | Starship 24 rolls out to the pad at Starbase https://t.co/PGh6FY6x8w | Twitter for iPhone | 0 | 0 | 0 | 0 | NA | photo | FALSE | TRUE |
| 2022-05-27 22:16:31 | elonmusk | (westcoastbill?) It was 96% in favor of me, until I commented. Ironically, there might be more left-wing bots following me than right-wing haha. | Twitter for iPhone | 23603 | 1177 | 83 | 1466 | NA | NA | FALSE | FALSE |
| 2022-05-27 22:03:30 | elonmusk | (IheartTesla?) (Tesla?) V true | Twitter for iPhone | 1869 | 92 | 7 | 152 | NA | NA | FALSE | FALSE |
| 2022-05-27 20:59:20 | elonmusk | (EvaFoxU?) Yes | Twitter for iPhone | 3014 | 142 | 9 | 236 | NA | NA | FALSE | FALSE |
| 2022-05-27 20:56:05 | elonmusk | Great potential to lift people out of poverty. Providing Internet is teaching people to fish. | Twitter for iPhone | 140136 | 11811 | 912 | 6232 | NA | NA | FALSE | FALSE |
| 2022-05-27 20:54:04 | elonmusk | One Starlink can provide Internet for an entire school of hundreds of students | Twitter for iPhone | 202116 | 16217 | 1316 | 21341 | NA | NA | FALSE | FALSE |
(path finding) Display the thread (replies) of tweets
(the tweet, time, id, in reply to id, user name with their screen name)
posted by Elon Musk with screen_name in the order in which they were
posted.(location) From which location have the tweets been
most actively posted (most number of tweets)?(hashtags) Which hashtags does Musk use the most, and
how many tweets are associated with these hashtags?(topics) What word does Musk mention the most in his
tweets? What company products does Musk mention the most in his tweets?
Products include Falcon 9, Starlink Satellites, Model 3 cars, etc.(trending) Are there any trends of what Musk tweets
about the company?(nature of engagement) What is the percentage of
different types of tweets (simple tweet, reply, retweet, quoted tweet)
to their overall number of tweets?Building a Databricks workspace using an Apache Spark cluster.
First, we create a Databricks workspace from the Azure portal and
then launch the workspace, which redirects us to the interactive
Databricks portal. We create a Spark cluster from the Databricks
interactive workspace and configure a notebook on the cluster. In the
notebook, we use PySpark to read data from a dataset into a
Spark DataFrame. Using the Spark DataFrame, we can run a Spark SQL job
to query the data.
The Databricks File System (DBFS) is a distributed
system that scales to an Azure Databricks workspace and builds on a
Databricks cluster. FileStore is a specific folder in DBFS where users
can import and store data files from their local computer into a DBFS
web browser. The Spark API can then read the imported data into an
Apache Spark DataFrame. For example, we can import a CSV file from our
local machine into a table using the Create Table UI in the Azure
Databricks workspace portal.
twitter_df = spark.read.csv(path = 'dbfs:/FileStore/dfclean.csv',
header = "true", multiLine = "true")
# Register table so it is accessible via SQL Context
twitter_df.createOrReplaceTempView('twitter_df')
The Azure Databricks system stores this data file in FileStore,
located in the /FileStore/tables folder. We can then use the following
PySpark command to read this data.
Users can perform relational procedures on DataFrames using a
domain-specific language (DSL) similar to R dataframes and Python
Pandas. DataFrames support standard relational operators, including
projection (select), filter (where),
join, and aggregations (group by).
| user_id | screen_name | name | description | followers_count | friends_count | favourites_count | account_created_at | verified |
|---|---|---|---|---|---|---|---|---|
| x44196397 | elonmusk | Elon Musk | NA | 95589999 | 114 | 13386 | 2009-06-02 20:12:29 | TRUE |
(mentions)Display the thread (mentions) of tweets (the tweet, time, id, in mention to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.
from pyspark.sql.functions import col
mentionDF = spark.sql(
"SELECT screen_name, \
created_at, \
text, \
mentions_user_id, \
mentions_screen_name \
FROM twitter_df"
)
mentionDF.filter(col("mentions_user_id") != "NA") \
.write \
.format('csv') \
.mode('overwrite') \
.option("overwriteSchema", "true") \
.saveAsTable("mentionDF")
mentionDF.createOrReplaceTempView('mentionDF')
display(spark.read.table("mentionDF"))
| screen_name | text | created_at | mentions_user_id | mentions_screen_name |
|---|---|---|---|---|
| elonmusk | Yeah, this really is me, as my Mom (mayemusk?) will attest. Not sure I can handle just doing 140 char missives. Will put longer thoughts on G+ | 2011-12-21 11:13:59 | x14171401 | mayemusk |
| elonmusk | Vending Machine Attendant Admits B3 Selection Has Changed A Lot Over The Years http://t.co/nccSGzCQ #OnionInnovation | 2011-12-27 19:54:15 | x14075928 | TheOnion |
| elonmusk | Any Idiot Could Have Come Up With The Car http://t.co/e9cLgfEg #OnionInnovation | 2011-12-28 22:29:48 | x14075928 | TheOnion |
| elonmusk | (TheOnion?) So true :) | 2011-12-28 22:27:08 | x14075928 | TheOnion |
SQL:
SELECT mentions_screen_name,
COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;
(types)What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?
When you create a managed table, Spark will manage both
the table data and the metadata. In particular data is written in the
default Hive warehouse. We want to create a table for each tweet type.
Below we visualize the different tweet types and frequency from Elon
Musk.
CREATE TABLE tweet_type (
status_id INT auto_increment PRIMARY KEY,
tweet_variant VARCHAR(50),
user_name VARCHAR(50),
created_at DATE,
tweet_text TEXT
);
So, the different tweet types are retweet,
quoted, reply_to, and normal.
Now, let’s create a table based on Elon Musk’s retweets.
CREATE TABLE data_retweets (
retweet_status_id CHAR,
retweet_text TEXT,
retweet_created_at DATE,
retweet_source TEXT,
retweet_favorite_count INT,
retweet_retweet_count INT,
retweet_user_id CHAR,
retweet_screen_name TEXT,
retweet_name TEXT,
retweet_followers_count INT,
retweet_friends_count INT,
retweet_statuses_count INT,
retweet_location TEXT,
retweet_description TEXT,
retweet_verified BOOLEAN
);
| retweet_status_id | retweet_text | retweet_created_at | retweet_source | retweet_favorite_count | retweet_retweet_count | retweet_user_id | retweet_screen_name | retweet_name | retweet_followers_count | retweet_friends_count | retweet_statuses_count | retweet_location | retweet_description | retweet_verified |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| x151742906267467777 | Vending Machine Attendant Admits B3 Selection Has Changed A Lot Over The Years http://t.co/nccSGzCQ #OnionInnovation | 2011-12-27 19:15:02 | Hootsuite | 22 | 120 | x14075928 | TheOnion | The Onion | 11836167 | 7 | 89149 | NA | America’s Finest News Source. | TRUE |
| x152151847614943233 | Any Idiot Could Have Come Up With The Car http://t.co/e9cLgfEg #OnionInnovation | 2011-12-28 22:20:01 | Hootsuite | 32 | 51 | x14075928 | TheOnion | The Onion | 11836167 | 7 | 89149 | NA | America’s Finest News Source. | TRUE |
SELECT retweet_screen_name,
COUNT(status_id) AS cnt
FROM data_retweets
GROUP BY retweet_screen_name
SORT BY cnt DESC;
(tweet trends)Are there any trends of when Elon Musk tweets?
The following SQL query parses the created_at datetime
column to display the year, month, day, and hour for each tweet.
from pyspark.sql.types import DateType
twitter_df.withColumn("created_at", twitter_df.created_at.cast(DateType()))
SELECT WEEKDAY(created_at) as created_weekday,
COUNT(*) as n
FROM twitter_df
GROUP BY created_weekday
ORDER BY created_weekday DESC
(topics)What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets?
import pyspark.sql.functions as f
data_df = twitter_df.select('text', 'status_id', 'created_at')
# Count and group word frequencies on text, when split by space comma
data_df.withColumn('word', f.explode(f.split(f.col('text'), ' '))) \
.groupBy('word') \
.count() \
.sort('count', ascending=False) \
.show()